Introdution

Airbnb, .

Airbnb not only has changed the possibilites of travel and ways of living, but also brought new business potetials. We are interested in exploring the data generated from Airbnb, analyzing the data to find interesting facts about airbnb listing in New York. We hope we could generate some useful insights to provide guidances for customers and business suggestions for hosts.

Get and Prepare Data

Data source: http://insideairbnb.com/get-the-data.html.

The data source provides a dataset of information from airbnb. We use the most recent (Sep, 2019) dataset for New York. The data is not cleaned, so we need to spend some time to tidy it.

The first obstacle is that the data file is relatively large. The csv. file downlowded is more than 180MB. So we just select the columns relavent to our analysis. We also exclude lists that without reviews, as we want to focus on.

【Another obstacle is that some information is compressed into one cell, like amenities. Amenities compress all the amenities provided in to one lone string, and it is not seperated by simple delimiter. We have tried to transform amenities into a seperate tidy data frame. We will try to clean other similar columns.】(不用的话可以删了)

Additionally, we define an active list as the onces received reviews within the past 12 months. Also we only includes data having price larger than zero, since we found that occurence of zero in price may due to error in data colleting.

For the cleaning fee (cleaning_fee) and security deposit (security_deposit), it is intuitive to replace the missings by 0. There are small amount of other variables are missing, we simply exclude the listings. It might be also due to collection error of the original data.

Finally, there are 28098 listings and 62 variables for our analysis. The variables includes

amentities for backup

#split amenities into tidy data frame
##function to modify amenities string
split_amenities<-function(x){
  #x is an input string
  x<-str_replace_all(x,"[{}]",",") #remove {}
  temp<-str_split(x,"\\\"")[[1]] #split by \"
  temp[str_starts(temp,"[,]") | str_ends(temp,"[,]")]<-str_replace_all(temp[str_starts(temp,"[,]") | str_ends(temp,"[,]")],"[,]","|")
  temp<-str_remove(temp,"^\\|")
  temp<-str_remove(temp,"\\|$")
  temp<-temp[temp!=""]
  out<-paste(temp,collapse = "|")
  return(out)
}

#38 list has no amenities
dat2<-dat1%>%select(id, amenities)%>%
  rowwise()%>%
  mutate(amenity = split_amenities(amenities))
dat3<-dat2%>%separate_rows(amenity,sep="\\|")

##output amenities
#write_csv(dat3%>%select(-amenities),"./data/amenities_201909.csv")

Analysis

The majority of listings in our data are among Manhattan and Brooklyn. The closer to the new york city downtown area, the denser the listings.

## map for listings
data = airbnb_cleaned
data_map_all <- data %>% dplyr::select(id, longitude, latitude)
sbbox <- make_bbox(lon = data_map_all$longitude, lat = data_map_all$latitude, f = .001)

ny_map_all <- get_map(location = sbbox, maptype = "satellite", source = "google")
ggmap(ny_map_all) + 
  geom_point(data = data_map_all, mapping = aes(x = longitude, y = latitude), color = "red", size = 0.0011, alpha = 0.6)

Price

Booking price is always an important factor both customers and hosts care about. In this section, we want to explore the facts of airbnb booking price in the market of New York.

How’s Booking Price Distributed?

The first plot is a general plot of distribution of price, we could see that it is skewed right with very long tail. Thus, we do a log 10 transformation so that we could have better visualization. The distribution shows that the median of price is 100. The boxplot in the following also shows that most of the listings have a price under 200 dollars.

What Makes Some Listings Most Expensive?

There are some listings have ridiculous high prices. What are these listings? Why are they so expensive? Let’s further explore these lisitngs with high prices.

We foucus on the listings with price more than $600. We could see that most of the listings belongs to Manhattan area, which is reasonable. Futhermore, most of the listings with high price are in Midtown, which is also not unexpected, because Midtown is the central portion of Manhattan.

Moreover, we find the room type of most of these listings with high price is entire home/apt and the number of bedrooms are about 2-4. This result may explain the high price of these listings. Booking a big apratment in Midtown of Manhattan is reasonable to be expensive.

Does Location Make a Difference?

Based on the price distribution above, we are interested in how prices are affected by different locations. We plotted the below push pin map which shows the price distribution among new york districts.

## map for different price
ny_price <- data %>% mutate(price_group = 
         case_when(price <= 100 ~ '$100', 
                   price <= 250 ~ '$100-$250',
                   price <= 500 ~ '$250-$500',
                   TRUE ~ '>$500')) 
ny_price$price_group <- factor(ny_price$price_group, levels = c('$100',  '$100-$250', '$250-$500', '>$500'))

data_map_price <- ny_price %>% select(id, longitude, latitude, price_group)
sbbox3 <- make_bbox(lon = data_map_price$longitude, lat = data_map_price$latitude, f = .001)

ny_map_price <- get_map(location = sbbox3, maptype = "satellite", source = "google")
ggmap(ny_map_price) + 
  geom_point(data = data_map_price, mapping = aes(x = longitude, y = latitude, color = price_group), size = 1, alpha = 0.5) +
  scale_color_brewer(palette = "Dark2")

Most of the listings are among $100-$250 per night and the closer we are to new york downtown, the higher the price. The purple points indicate expensive listings above $250 per night. We can rarely see listings that cost more than $500 per night.

Indeed, price is largely affected by location. To better illustrate, we plotted the average price per night for the five neighborhoods and we can see that Bronx, Staten Island, Queens are almost the same. Brooklyn and Manhattan are more expensice by about $50 to $100 per night, which is a large amount considering our majority of the prices are around $100 per night.

#average price for different area
avg_airbnb <- airbnb_cleaned %>%
  group_by(neighbourhood_group_cleansed) %>%
  summarize(avg_price = mean(price)) %>%
  arrange(desc(avg_price))
colnames(avg_airbnb) <- c("neighbourhood","avg_price" )

  
ggplot(avg_airbnb) + 
  geom_bar(aes(x = reorder(neighbourhood, avg_price), y = avg_price),stat="identity") +
  xlab("Neighbourhood") +
  ggtitle("Average price for different area ")

#Price distribution of different neighborhood 
ggplot(airbnb_cleaned, aes(x= price, color = neighbourhood_group_cleansed)) + geom_density() +
  ggtitle("Price distribution in different neighbourhoods")

Cleaning fee

ggplot(data = airbnb_cleaned, aes(x = cleaning_fee)) + geom_histogram(binwidth = 2)+
  ggtitle("Distribution of cleaning fees")

Does Room Type and Property Type Affect Price?

#Price distribution of different room type
ggplot(airbnb_cleaned, aes(x= price, color = room_type)) + geom_density() +
  ggtitle("Price distribution of different room type")

avg_airbnb <- airbnb_cleaned %>%
  group_by(neighbourhood_group_cleansed, room_type) %>%
  summarize(avg_price = mean(price)) %>%
  arrange(neighbourhood_group_cleansed) 
 

ggplot(avg_airbnb, aes(x = reorder(neighbourhood_group_cleansed, avg_price), y = avg_price, fill = room_type)) + 
  geom_bar(stat="identity",position = "dodge") + ggtitle("Average price for different neighborhoods with room type")+
  xlab("neighbourhood")

Surprisingly, at such an expensive living area like Manhattan, lots of listings are “Entire Home/Apt”. It’s true that Manhattan has the most expensive listings but they actually also have relative good qualities (instead of being all shared rooms and small private rooms).

## map for different room types
data_map_room <- data[sample(nrow(data)),] %>% select(id, longitude, latitude, room_type)
sbbox1 <- make_bbox(lon = data_map_room$longitude, lat = data_map_room$latitude, f = .001)

ny_map_room <- get_map(location = sbbox1, maptype = "satellite", source = "google")
ggmap(ny_map_room) + 
  geom_point(data = data_map_room, mapping = aes(x = longitude, y = latitude, color = room_type), size = 1, alpha = 0.5) + 
  scale_fill_brewer(palette = "Dark2")

## map for different property type
data_map_property <- data %>% select(id, longitude, latitude, property_type)
sbbox2 <- make_bbox(lon = data_map_property$longitude, lat = data_map_property$latitude, f = .001)

ny_map_property <- get_map(location = sbbox2, maptype = "satellite", source = "google")
ggmap(ny_map_property) + 
  geom_point(data = data_map_property, mapping = aes(x = longitude, y = latitude, color = property_type), size = 1, alpha = 0.5)

#Top average price 
avg_airbnb <- airbnb_cleaned %>%
  group_by(airbnb_cleaned$neighbourhood_cleansed) %>%
  summarize(avg_price = mean(price)) %>%
  arrange(desc(avg_price))
colnames(avg_airbnb) <- c("neighbourhood","avg_price" )

  
ggplot(avg_airbnb[1:20,]) + 
  geom_bar(aes(x = reorder(neighbourhood, avg_price), y = avg_price),stat="identity") +
  coord_flip() + ggtitle("Top 20 average price ")

#Top average price(different room type)
neigh_list <- dplyr::pull(avg_airbnb[1:10,1])

unique(airbnb_cleaned$room_type)
## [1] "Entire home/apt" "Private room"    "Shared room"     "Hotel room"
temp <- airbnb_cleaned %>%
  group_by(neighbourhood_cleansed, room_type) %>%
  summarize(avg_price = mean(price)) 

temp %>% filter(room_type == "Private room") %>% arrange(desc(avg_price))
## # A tibble: 202 x 3
## # Groups:   neighbourhood_cleansed [202]
##    neighbourhood_cleansed room_type    avg_price
##    <chr>                  <chr>            <dbl>
##  1 Bay Terrace            Private room      265 
##  2 Breezy Point           Private room      195 
##  3 Belle Harbor           Private room      178.
##  4 Theater District       Private room      175.
##  5 Midtown                Private room      171.
##  6 Tribeca                Private room      151.
##  7 NoHo                   Private room      139.
##  8 West Village           Private room      138.
##  9 Murray Hill            Private room      136 
## 10 SoHo                   Private room      132.
## # … with 192 more rows

Property type

#Distribution of property type
temp <- airbnb_cleaned %>%
  group_by(property_type) %>%
  summarise(counts = n())%>%
  arrange(desc(counts)) %>%
  filter(counts > 5)

ggplot(temp) + 
  geom_bar(aes(x = reorder(property_type, counts),y = counts), stat = 'identity') +  coord_flip() +
             ggtitle("distribution of property type ")

Reviews

Room type

Host

Interactive

Conclusion

Appendix: code

#Data Importing and Cleaning
data <- read_csv ("listings_201909.csv", na=c("","NA","N/A"))

##select columns
dat<-data%>%
  select(-c(scrape_id:xl_picture_url))%>%
  select(-host_url, -host_name, -host_location, -host_about,
         -host_acceptance_rate, -host_thumbnail_url, -host_picture_url, 
         -host_neighbourhood)%>%
  select(-street,-city,-state, -market, -smart_location, -country, 
         -country_code)%>%
  select(-jurisdiction_names, -license,-weekly_price,-monthly_price, 
         -square_feet)%>%
  select(-c(calendar_updated:calendar_last_scraped)) 
#dim(dat)  #48377    62

##modify data type
dat<-dat%>%
  mutate_at(c("host_response_rate","extra_people",
              "price","security_deposit","cleaning_fee"),
            str_remove_all,pattern="[%$]")%>%
  mutate_at(c("host_response_rate","extra_people",
              "price","security_deposit","cleaning_fee"),
            as.numeric)

##select lists which have reviews within the last 12 months
data_cleaned<-dat%>%filter(!is.na(first_review))%>%
  filter(last_review>='2018-09-01')
#dim(dat1) #28105    62

#clean NA
##replace missing values
data_cleaned$cleaning_fee[is.na(data_cleaned$cleaning_fee)] <- 0
data_cleaned$security_deposit[is.na(data_cleaned$security_deposit)] <- 0
data_cleaned<-data_cleaned%>%filter(price>0)

##exclude missing values
completeFun <- function(data, desiredCols) {
  completeVec <- complete.cases(data[, desiredCols])
  return(data[completeVec, ])
}

data_cleaned <- completeFun(data_cleaned, c("review_scores_value", "review_scores_checkin", "review_scores_accuracy", "review_scores_communication", "review_scores_cleanliness","review_scores_rating","neighbourhood", "review_scores_location", "price", "bedrooms", "beds","bathrooms", "host_identity_verified","zipcode"))

#sort(colSums(is.na(data_cleaned)),decreasing = TRUE)

dim(data_cleaned) #28098

#cleaned data output
##write_csv(data_cleaned,"data_cleaned.csv")